Capstone Project - RSNA Pneumonia Detection Challenge

Group 6-CV 1:

Contributor - Ashish Tiwari, Kanishk Sanger, Ninad Mahajan, Tushar Bisht, Avinash Kumra Sharma

Mentor - Mr. Rohit raj

RSNA Pneumonia Detection

Pneumonia is an infection in one or both lungs. Bacteria, viruses, and fungi cause it. The infection causes inflammation in the air sacs in your lungs, which are called alveoli.

CXRs are the most commonly performed diagnostic imaging study. A number of factors such as positioning of the patient and depth of inspiration can alter the appearance of the CXR, complicating interpretation further. In addition, clinicians are faced with reading high volumes of images every shift.

Automating Pneumonia screening in chest radiographs, providing affected area details through bounding box.

Objective:

The objective of this project is to build an algorithm to locate the position of inflammation in a medical image. The algorithm needs to locate lung opacities on chest radiographs automatically

The objective of the project is,

  1. Learn to how to do build an Object Detection Model
  2. Use transfer learning to fine-tune a model.
  3. Learn to set the optimizers, loss functions, epochs, learning rate, batch size, checkpointing, early stopping etc.
  4. Read different research papers of given domain to obtain the knowledge of advanced models for the given problem.

Acknowledgment for the datasets: https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/overview/acknowledgements

Importing and installing the necessary Libraries

Enable the below installation packages if not already installed

EDA and Visualization

  1. Importing Data
  2. Analysing the dimensions of data
  3. Visualizing the data

The input folder contains 4 important information

stage_2_train_labels.csv - CSV file containing the patient id, bounding boxes and target label

stage_2_detailed_class_info.csv - CSV file containing the detail informaiton of patientid and the corresponding label

stage_2_train_images - directory contains train images in DICOM format

stage_2_test_images - directory contains test images in DICOM format

In Detailed class info dataset , the detailed information about the type of class associated with a certain patientId is given. It has 3 entries "Lung Opacity", "Normal" and "No Lung Opacity/Not Normal"

The CSV file contains PatientId, bounding box details with (x,y) coordinates and width and height that encapsulates the box. It also contains the Target variable. For target variable 0, the bounding box values has NaN values.

If we look closely, there are duplicate entries for patientId in the csv files. We can observe row #4 and #5, row #8 and #9 have same patientId values, aka, the patient is identified with pneumonia at multiple areas in lungs

Check the unique patient ID in the train dataset

Checking missing data in two datasets

For the info of the data, we observe that of the total 30227 rows, 9555 rows has non null. So, all bounding boxes are either defined or not defined.

We see from above that the total number of patientIds that are identified with Pneumonia are 9555 and it matches to the non null values. It can be inferred from this that all pneumonia data set has bounding boxes defined and for normal patients, no bounding boxes exist.

68.38% of values are missing for x,y, height and width in train labels for target 0 (not Lung opacity) in train labels dataset

Checking class distribution in Detailed class info dataset

No Lung Opacity / Not Normal and Normal have together the same percent (68.39%) as the percent of missing values for target window in class details information.

In the train set, the percent of data with pneumonia is therefore 31.61%.

The target has two classifications 0 and 1 namely Normal and Pneumonia

Merging train labels and Detailed class info datasets to get more insights

Exploring Dicom image files - Reading training & test files

Extracting a single image and processing DICOM information

It is observed that some useful information are available in the DICOM metadata with predictive values, for example:

Patient sex, Patient age, Modality, Body part examined, View position, Rows & Columns, Pixel Spacing

Plotting dicom images with Target = 1

Next step is to represent the images with the overlay boxes superposed. For this, the whole dataset with Target = 1 has to be parsed and all coordinates of the windows showing a Lung Opacity on the same image have to be gathered.

For some of the images with Target=1, we could see multiple areas (boxes/rectangles) with Lung Opacity.

Plotting DICOM images with Target = 0

Adding metadata information from Dicom data to train and test datasets

Both AP and PA body positions are present in the data. The meaning of these view positions are :

AP - Anterior/Posterior; PA - Posterior/Anterior.

image.png

Test dataset : Checking the distribution of AP and PA positions for the test set

Both train and test have only WSD Conversion Type Data. The meaning of this Conversion Type is WSD: Workstation

Only {Rows:Columns} {1024:1024} are present in both train and test.

Even though the size of the images are same, we can observe that each image has different aspect ratio, that is, not all images occupy the same space. Also, the brightness is different.

Distribution of patient age for the test data set

Distribution of Patient Sex for the test data. We can see few age for the dataset is mentioned as 412, which is an incorrect data

Conclusion

After exploring both the tabular and DICOM data, we were able to:

  1. discover duplications in the tabular data
  2. explore the DICOM images
  3. extract meta information from the DICOM data add features to the tabular data from the meta information in DICOM data further analyze the distribution of the data with the newly added features from DICOM metadata
  4. All these findings are useful for building a model.

Preprocess the dataset for model input

Let us print the image with maximum bounding boxes

Need to identify the distribution (Pneumonia and normal) of the data used Consider a portion of the dataset for train (2000) and validation (25%) of train

The distribution of the data in both train and validation is good. Also, the distribution of same label data across the train adn test are also good to proceed

Freeze initial layers and train the top 3 layers to capture the features for the images that require to train

Model Evaluation with transfer learning

Calculate Precision, recall and F1score for classification and IOU for bounding box predictions